Subway is an emerging and growing means of transportation that can transport passengers safely and effectively. However, due to the increase of metro users, subway congestion has become one of the main factors that lower the efficiency of subway operations.
I personally use subway system a lot, both in China and in the U.S. Every time I was stuck in crowds moving slowly towards the entry of subway gates, and squeeze in a completely packed subway car, I was thinking that “I would continue shopping and wait till the subway station gets less congested, if I had known how packed the station is”.
Congestion prediction, therefore, would be a great way to maximize convenience and comfort for metro users. Since NYC’s MTA subway ranks top among the world’s subway systems in terms of annual ridership, and MTA has the most available subway turnstile data, I chose to conduct my analysis based on MTA subway in New York.
This project includes three main parts:
- Build an OLS regression model for forecasting the hourly net entries of metro users at each MTA subway station in NYC.
- Define the congestion level of each station based on prediction results. Other factors to consider include number of turnstiles, number of subway lines running at the station, number of entrances, etc.
- Develop a web-based user interface to inform metro users about the predicted future congestion level at each MTA station at the time they specify.
In this section, I will explain the methods I use to build the regression model, compute the congestion score and create the web application.
In the regression analysis, I would like to prepare a dataset containing the hourly net entries of metro users per station in July and August 2017. The net entries is the dependent variable in the OLS regression model. The raw turnstile usage dataset I use to prepare the dependent variable is provided by New York State open data platform, and one big limitation of the dataset is that, the entries and exits data are not collected on an hourly basis but are collected every 4 hours.
To predict the dependent variable - net entries, I consider five categories of predictors, including time variables, spatial variables, census variables, weather variables and trips of other transportation methods.
For time variables, I consider factors like day of the week, day of the month, rush hour and holiday. Furthermore, I speculate that the net entries are time auto-correlated. That is, the net entries at a station in hour n may influence the net entries at that station in hour n-1. In other words, the net entries at a subway station in a certain hour may be associated with the net entries at the station in previous hour, so I also include lagged net entries as a time variable. Another predictor I include in the model is count of public events. Public events could drive people to go out and subway would be one of the major transportation choices for them.
For spatial variables, I calculate the distance between each subway station and its nearest public facilities such as bus stops, schools, plaza and malls, offices, hospitals, parking lots, as well as other subway stations.
For census variables, I choose census block group demographics from U.S. Census Bureau like population density, total housing units and race. I also include ESRI demographic metrics like businesses and percentage of development.
Moreover, people tend to use subway more in a sunny day than in a rainy day, and a warm day than a chilly day. Hence, I include weather data as predictors in the regression model.
Lastly, other transportation modes can provide people with alternatives for transit and may affect net entries. Thus, the model also includes hourly taxi trip counts and bike trip counts as independent variables for predicting net entries every 4 hours at each station. Due to limitation of data available, I do not include hourly bus trip counts.
Here is the outline of the OLS regression model:
Dependent variable:
Net Entries of Metro Users Every 4 Hours at Each Station in NYC (NetEntries)
Predictors:
- Time lag (lag_NEntries)
- Day of the week (Weekday)
- Day of the month (Day)
- Rush hour or not (Rush_Hour)
- Holiday or not (Holiday)
- Event count (Event_Count & Event_Count2)
- Distance to hospital (d_hospital)
- Distance to college (d_college)
- Distance to nearest subway stations (d_subway)
- Distance to parks (d_parks)
- Distance to CBDs (d_cbd)
- Distance to bus stops (d_busstop)
- Distance to public schools (d_school)
- Distance to plaza and malls (d_plazamalls)
- Distance to recreation facilities (d_recreation)
- Distance to parking lots (d_parkinglot)
- Distance to offices (d_office)
- Total Housing Unit (TOTHU_CY)
- Population Density (POPDENS_CY)
- Median Household Income (MEDHINC_CY)
- Businesses (S01_BUS)
- Percentage of Development (HISPPOP_CY)
- Race (WHITE_CY, BLACK_CY, AMERIND_CY, ASIAN_CY, PACIFIC_CY, OTHRACE_CY)
- Temperature (TEMP_C)
- Humidity (HUMIDITY)
- Pressure (PRESSURE)
- Wind speed (WIND_SPEED)
- Wind direction (WIND_DIR)
- Weather description (WEATHER)
- Taxi trips (TaxiTrips)
- Bike trips (BikeTrips)
Before running the OLS regression using these predictors mentioned above, I conduct some exploratory analysis to better understand the temporal and spatial pattern of subway net entries and help me decide which variables to use in the final model.
The OLS regression analysis includes three parts - in-sample regression, out-of-sample regression and cross validation. I first run the regression on the full dataset as in-sample regression. After that, I randomly select 75% of the net entries observations from the data as the training set and the other 25% net entries observations as the test set to run out-of-sample regression. Finally, I validate my regression model by conducting a 20-fold cross validation, which allows me to see how generalizable my model is.
In the next section - Installation, I will explain in details how I conduct both exploratory and regression analysis to build the final OLS model.
The main result of the finished OLS regression model is predicted net entries every 4 hours at each MTA subway station in NYC in July and August 2017. This prediction result is used as the main factor to compute the congestion score.
Other factors I consider to compute the congestion socre include weekday or weekend, rush hour or not, number of turnstiles, number of subway lines running at each station, and number of station entrances. I perform additional calculations to obtain the following 8 metrics for congestion score computation: Weekday vs. Weekend (s_weekday), Rush hour vs. Not rush hour (s_rushhr), Predicted net entries (s_pred), Net entries per turnstile (s_entTrnst), Net entries per entrances (s_entEnt), Net entries per subway lines (s_entLine), Turnstiles per entrance (s_trnstEnt) and Subway lines per entrance (s_lineEnt). All these factors are divided into 5 quantile breaks and assigned a score with 5 meaning most likely to be congested and 1 meaning least likely to be congested. I then rank these 7 factors from most important to least important, and give them a weight from 5 to 1, with 5 meaning most important and 1 meaning least important. Finally, I compute the congestion score using the following equation:
Congestion Score =
5 X s_entLine + 4 X s_weekday + 3 X (s_entTrnst+s_entEnt) + 2 X (s_trnstEnt+s_lineEnt+s_rushhr) + s_pred
Finally, I divide the computed congestion scores into 5 categories: first 10% has a congestion level of 1, 10% - 40% has a congestion level of 2, 40% - 60% has a congestion level of 3, 60% - 90% has a congestion level of 4 and 90% - 100% has a congestion level of 5. When the congestion level is identified to be 5, the metro users can safely consider the station to be congested at the time they specify and try to find an alternative way of transit or change a different time to take the subway.
The interactive web application is created using Shiny package in R. The application allows metro users to check how congested an MTA subway station will be at a specified time.
The dataset I use to build the web app contains predicted net entries and congestion score for all MTA stations in July and August 2017. The web app serves as a prototype for building more practical congestion forecast web app in the future.
Please click here to access my published MTA subway congestion forecast web app. It may take several seconds to load since the dataset has over 100,000 rows of data. More detailed explaination of how users can use the web app will be provided in later section.
In this section, a detailed step-by-step installation procedure is presented to show how I conduct my analysis.
The first step of building this OLS regression model is data gathering. Below is a list of the raw data I collected, and a description of where I collect the data and what information it tells.
1. Turnstile Usage Data 2017
This CSV dataset is downloaded from New York State open data platform. The dataset contains cumulative entry and exits data for each turnstile unit at each MTA station. The image below shows how the raw dataset looks like.
2. MTA Subway Entrance and Exit Data
This CSV dataset is also downloaded from New York State open data platform. The data contains all entrances and exits information for MTA stations, as well as which subway lines are running at each station.
3. Weather Data
The weather data is obtained from Kaggle. The data is posted by user Selfish Gene. The data contains 5 years of hourly data of various weather attributes, such as temperature, humidity, air pressure, etc, and is available for 30 US and Canadian Cities (including New York), as well as 6 Israeli cities. The data is acquired using Weather API on the OpenWeatherMap website, and is available under the ODbL License.
4. School Locations
The shapefile is obtained from NYC Open Data. The data contains school point locations based on the official address.
5. Bus Stop Locations
The shapefile is obtained from The William and Anita Newman Library. The data contains bus stop locations in NYC.
6. Park Locations
The shapefile is downloaded from NYC Open Data. The data contains a polygon layer displaying all parks in NYC.
7. Facilities Database
The shapefile is also downloaded from NYC Open Data. The data contains information about 35,000+ public and private facilities and program sites that are owned, operated, funded, licensed or certified by a City, State, or Federal agency in the City of New York.
8. NYC Census Data
I obtain census data in census block groups level from ArcGIS Online. After logging in, I locate New York, select “Analysis” and “Layer Enrichment”, and add population density, total housing units, median household income and other needed census variables in USA census block group level to the map layer. Then I export these data as a shapefile for importing into R.